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Abstract 

We develop a hierarchical matrix construction algorithm using matrix- 
vector multiplications, based on the randomized singular value decomposition 
of low-rank matrices. The algorithm uses 0(logn) applications of the matrix 
on structured random test vectors and 0{n\ogn) extra computational cost, 
where n is the dimension of the unknown matrix. Numerical examples on 
constructing Green's functions for elliptic operators in two dimensions show 
efficiency and accuracy of the proposed algorithm. 

Keywords: fast algorithm, hierarchical matrix construction, randomized 
singular value decomposition, matrix-vector multiplication, elliptic 
operator. Green's function 



1. Introduction 

In this work, we consider the following problem: Assume that an unknown 
symmetric matrix G has the structure of a hierarchical matrix (^/-matrix) jl|, 
y, |3|, that is, certain off-diagonal blocks of G are low-rank or approximately 



low-rank (see the definitions in Sections II. 31 and (2. 2p . The task is to construct 
G efficiently only from a "black box" matrix-vector multiplication subroutine 
(which shall be referred to as matvec in the following). In a slightly more 
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general setting when G is not symmetric, the task is to construct G from 
"black box" matrix- vector multiplication subroutines of both G and G^ . In 
this paper, we focus on the case of a symmetric matrix G. The proposed 
algorithm can be extended to the non-symmetric case in a straightforward 
way. 

1.1. Motivation and applications 

Our motivation is mainly the situation that G is given as the Green's func- 
tion of an elliptic equation. In this case, it is proved that G is an "H-matrix 
under mild regularity assumptions 0]. For elliptic equations, methods like 
preconditioned conjugate gradient, geometric and algebraic multigrid meth- 
ods, sparse direct methods provide application of the matrix G on vectors. 
The algorithm proposed in this work then provides an efficient way to con- 
struct the matrix G explicitly in the H-matrix form. 

Once we obtain the matrix G as an H-matrix, it is possible to apply G 
on vectors efficiently, since the application of an H-matrix on a vector is 
linear scaling. Of course, for elliptic equations, it might be more efficient 
to use available fast solvers directly to solve the equation, especially if only 
a few right hand sides are to be solved. However, sometimes, it would be 
advantageous to obtain G since it is then possible to further compress G 
according to the structure of the data (the vectors that G will be acting on), 
for example as in numerical homogenization . Another scenario is that the 
data has special structure like sparsity in the choice of basis, the application 
of the resulting compressed matrix will be more efficient than the "black 
box" elliptic solver. 

Let us remark that, in the case of elliptic equations, it is also possible to 
use the T^-matrix algebra to invert the direct matrix (which is an H-matrix in 
e.g. finite element discretization). Our method, on the other hand, provides 
an efficient alternative algorithm when a fast matrix-vector multiplication 
is readily available. From a computational point of view, what is probably 
more attractive is that our algorithm facilitates a parallelized construction 
of the "H- matrix, while the direct inversion has a sequential nature 

As another motivation, the purpose of the algorithm is to recover the 
matrix via a "black box" matrix-vector multiplication subroutine. A general 
question of this kind will be that under which assumptions of the matrix, 
one can recover the matrix efficiently by matrix-vector multiplications. If 
the unknown matrix is low-rank, the recently developed randomized singular 
value decomposition algorithms (g], 0, Isl provide an efficient way to obtain 
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the low-rank approximation through apphcation of the matrix on random 
vectors. Low-rank matrices play an important role in many applications. 
However, the assumption is too strong in many cases that the whole matrix 
is low-rank. Since the class of "H-matrices is a natural generalization of the 
one of low-rank matrices, the proposed algorithm can be viewed as a further 
step in this direction. 

1.2. Randomized singular value decomposition algorithm 

A repeatedly leveraged tool in the proposed algorithm is the randomized 
singular value decomposition algorithm for computing a low rank approx- 
imation of a given numerically low-rank matrix. This has been an active 
research topic in the past several years with vast literature. For the purpose 
of this work, we have adopted the algorithm developed in (3], although other 
variants of this algorithm with similar ideas can also be used here. For a 
given matrix A that is numerically low-rank, this algorithm goes as following 
to compute a rank-r factorization. 



Algorithm 1 Construct a low-rank approximation A ^ U1MU2 for rank r 



1: 


Choose a Gaussian random matrix Ri € where 


c is a small 




constant; 




2: 


Form ARi and apply SVD to ARi. The first r left sing 


;ular vectors 




give Ui; 




3: 


Choose a Gaussian random matrix R2 G M"-^{'"+'^); 




4: 


Form RJA and apply SVD to A^R2. The first r left sing 


;ular vectors 




give U2; 




5: 


M = {RjUi)^R'^{ARi)]{UjRi)^, where denotes 


the Moore- 




Penrose pseudoinverse of matrix B pp. 257-258]. 





The accuracy of this algorithm and its variants has been studied thor- 
oughly by several groups. If the matrix 2-norm is used to measure the error, 
it is well-known that the best rank-r approximation is provided by the singu- 
lar value decomposition (SVD). When the singular values of A decay rapidly, 
it has been shown that Algorithm IT] results in almost optimal factorizations 
with an overwhelming probability |6|. As Algorithm [1] is to be used frequently 
in our algorithm, we analyze briefly its complexity step by step. The gen- 
eration of random numbers is quite efficient, therefore in practice one may 
ignore the cost of steps 1 and 3. Step 2 takes (r -|- c) matvec of matrix A 
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and 0{n{r + c)^) steps for applying the SVD algorithms on an n x (r + c) 
matrix. The cost of step 4 is the same as the one of step 2. Step 5 involves 
the computation of R2{ARi), which takes 0{n{r + c)^) steps as we have 
already computed ARi in step 2. Once Rj{ARi) is ready, the computation 
of M takes additional 0{{r + c)^) steps. Therefore, the total complexity of 
Algorithm [T] is 0{r + c) matvecs plus 0{n{r + c)^) extra steps. 

1.3. Top-down construction ofH-matrix 

We illustrate the core idea of our algorithm using a simple one-dimensional 
example. The algorithm of constructing a hierarchical matrix G is a top-down 
pass. We assume throughout the article that G is symmetric. 

For clarity, we will first consider a one dimension example. The details of 
the algorithm in two dimensions will be given in Section 2. We assume that 
a symmetric matrix G has a hierarchical low-rank structure corresponding 
to a hierarchical dyadic decomposition of the domain. The matrix G is of 
dimension n x n with n = 2^''' for an integer Lm- Denote the set for all 
indices as Xo;i, where the former subscript indicates the level and the latter 
is the index for blocks in each level. At the first level, the set is partitioned 
into Xi;i and Xi;2, with the assumption that G'(Xi;i,Xi;2) and G'(Xi;2,Xi;i) 
are numerically low-rank, say of rank r for a prescribed error tolerance e. 
At level /, each block X;_i.j on the above level is dyadically decomposed 
into two blocks X/.2i_i and X/.2j with the assumption that G(X;.2i_i,X;.2i) and 
G(Xi;2i,X/;2j-i) are also numerically low-rank (with the same rank r for the 
tolerance e). Clearly, at level /, we have in total 2' off-diagonal low-rank 
blocks. We stop at level Lm-, for which the block X^^^ j only has one index 
{i}. For simphcity of notation, we will abbreviate G(Xj.j,X;.j) by Gi^ij. We 
remark that the assumption that off-diagonal blocks are low-rank matrices 
may not hold for general elliptic operators in higher dimensions. However, 
this assumption simplifies the introduction of the concept of our algorithm. 
More realistic case will be discussed in detail in Sections 12.31 and 12.41 

The overarching strategy of our approach is to peel off the off-diagonal 
blocks level by level and simultaneously construct their low-rank approxima- 
tions. On the first level, Gi;i2 is numerically low-rank. In order to use the 
randomized SVD algorithm for Gi;i2, we need to know the product of Gi;i2 
and also Gj.12 = Gi-2i with a collection of random vectors. This can be done 
by observing that 

fGl-ll Gl:12\ _ / Gl;ll-Rl:l\ 

{G^,2l Gi,22j V ; \Gl;2lRl;lJ ' 
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Gi-u 

Gl-21 



G'l;12 
Gl;22 





^1;: 



G'l;12-Rl;2 
G'l;22-Rl;2 



(2) 



where -Ri;i and Ri-2 are random matrices of dimension n/2 x (r + c). We 
obtain (Gi;2i-Ri;i)"'" = -Ri^iG'i;i2 by restricting the right hand side of Eq. ([T]) 
to Xi;2 and obtain Gi;i2-Ri;2 by restricting the right hand side of Eq. (|2]) to 
Xi;i, respectively. The low-rank approximation using Algorithm [1] results in 



T 

1;21- 



(3) 



?7i;i2 and t/i;2i are n/2 x r matrices and M1.12 is an r x r matrix. Due to the 
fact that G is symmetric, a low-rank approximation of Gi-2i is obtained as 
the transpose of Gi-u. 

Now on the second level, the matrix G has the form 



/ G2-11 
G2-21 



G2-12 
^2:22 



v 



Gi 



21 



G^2;33 
G2-A3 



G ^ 

G^2;34 
G^2;44/ 



The submatrices G2;i2, G'2;2i, G'2;34, and G'2;43 are numerically low-rank, to 
obtain their low-rank approximations by the randomized SVD algorithm. 
Similar to the first level, we could apply G on random matrices of the form 
like {R2-1, 0, 0, 0)""". This will require 4(r-|-c) number of matrix- vector multipli- 
cations. However, this is not optimal: Since we already know the interaction 
between Zi-i and Xi;2, we could combine the calculations together to reduce 
the number of matrix-vector multiplications needed. Observe that 



/ G2;ll 
G2-21 



G2-12 
^2:22 



V 



Gl; 



21 



G'2;33 
G^2;43 



G ^ 

G'2;34 
G2-4A/ 



/i?2;l\ 


^2,3 

\ / 



G'2;ll-R2;l 
G2;2lR2;l 
G^2;33-R2;3 
G^2;43-R2;3 



Denote 





with (ji;i2 and G1.21 the low-rank approximations we constructed on the first 
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level, then 





\ / 



Gi-u 

Gl-2i 



R2;3 






(6) 



Therefore, 



(G - GW) 



/^2;l\ 



-R2;3 

V / 

so that we simultaneously obtain (G2;2i-R2;] 



/ G2;ll-R2;l\ 
G2;21-R2;l 
G2;33-R2;3 

yG2;43-R2;3/ 



(7) 



-R2;1^2;12 and (G2;43-R2;3 



-R2-3^2;34- Similarly, applying G on (0, -R2;25 0, R2-4) provides G2;i2-R2;2 and 
G2;34-R2;4- We cau then obtain the following low-rank approximations by 
invoking Algorithm [H 



G 
G 



2:12 



2:34 



G2;12 
^2:34 



U2:uM2-uU., 



2:21) 



(8) 



2:43- 



The low-rank approximations of G2;2i and G2;43 are again given by the trans- 
poses of the above formulas. 

Similarly, on the third level, the matrix G has the form 



/ Gs;!! G3:i2 
G'3;21 G3:22 



G 



2:21 



G 

G3;33 
G3;43 



2;12 

G3;34 
G3;44 



\ 



Gl: 



21 



G^3;55 
G^3;65 

G 



Gl; 

G^3;56 
G^3;66 

2:43 
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and define 



G(2) 



G2;21 

V 



G2;12 











G 

G2;43 



G'3;77 
G3;87 

\ 



G2;34 

G'3;78 

Gs-ss 



(9) 



2;34 

/ 



(10) 
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We could simultaneously obtain the product of 03-12, G^-s^, G^.^q and G3;78 
with random vectors by applying the matrix G with random vectors of the 
form 

{rJ.^, 0, rJ.^, 0, Rj.r^, 0, RJ.j, 0)"^, 

then subtract the product of G'^^^+G'^^^ with the same vectors. Again invoking 
Algorithm [1] provides us the low-rank approximations of these off-diagonal 
blocks. 

The algorithm continues in the same fashion for higher levels. The com- 
bined random tests lead to a constant number of matvec at each level. As 
there are log(n) levels in total, the total number of matrix-vector multiplica- 
tions scales logarithmically. 

When the block size on a level becomes smaller than the given criteria (for 
example, the numerical rank r used in the construction), one could switch to 
a deterministic way to get the off-diagonal blocks. In particular, we stop at 
a level L [L < Lm) such that each contains about r entries. Now only 
the elements in the diagonal blocks need to be determined. This can 
be completed by applying G to the matrix 

(/,/,..., /f, 

where / is the identity matrix whose dimension is equal to the number of 
indices in Xx,.j. 

Let us summarize the structure of our algorithm. From the top level to the 
bottom level, we peel off the numerically low-rank off-diagonal blocks using 
the randomized SVD algorithm. The matrix-vector multiplications required 
by the randomized SVD algorithms are computed effectively by combining 
several random tests into one using the zero pattern of the remaining matrix. 
In this way, we get an efficient algorithm for constructing the hierarchical 
representation for the matrix G. 

1.4- Related works 

Our algorithm is built on top of the framework of the ?^-matrices pro- 
posed by Hackbusch and his collaborators [H, Q, 0]- The definitions of the 
"H-matrices will be summarized in Section [2l In a nutshell, the "H-matrix 
framework is an operational matrix algebra for efficiently representing, ap- 
plying, and manipulating discretizations of operators from elliptic partial 
differential equations. Though we have known how to represent and ap- 



ply these matrices for quite some time [10|, it is the contribution of the 
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"H-matrix framework that enables one to manipulate them in a general and 
coherent way. A closely related matrix algebra is also developed in a more 
numerical-linear-algebraic viewpoint under the name hierarchical semisepa- 
rable matrices by Chandrasekaran, Gu, and others ll|, [l^. Here, we will 
follow the notations of the ?^-matrices as our main motivations are from 
numerical solutions of elliptic PDEs. 

A basic assumption of our algorithm is the existence of a fast matrix- 
vector multiplication subroutine. The most common case is when G is the 
inverse of the stiffness matrix if of a general elliptic operator. Since H is 
often sparse, much effort has been devoted to computing u = Gf by solving 
the linear system Hu = f. Many ingenious algorithms have been developed 
for this purpose in the past forty years. Commonly-seen^ examples include 
multifrontal algorithms 



13, 14 



geometric multigrids 
domain decompositions methods [1 
20| and preconditioned conjugate gradient algorithms 




algebraic 
wavelet- 



Very recently, both Chandrasekaran et al [22 



multigrids (AMG) pj 
based fast algorithms 
(PCG) 0, to name a few 
and Martinsson 23|] have combined the idea of the multifrontal algorithms 
with the "H-matrices to obtain highly efficiently direct solvers for Hu = f. 
Another common case for which a fast matrix-vector multiplication subrou- 
tine is available comes from the boundary integral equations where G is often 
a discretization of a Green's function restricted to a domain boundary. Fast 
algorithms developed for this case include the famous fast multipole method 



10l |. the panel clustering method [2J], and others. All these fast algorithms 
mentioned above can be used as the "black box" algorithm for our method. 

As shown in the previous section, our algorithm relies heavily on the 
randomized singular value decomposition algorithm for constructing the fac- 
torizations of the off-diagonal blocks. This topic has been a highly active 
research area in the past several years and many different algorithms have 
been proposed in the literature. Here, for our purpose, we have adopted 
the algorithm described in 0, [sj. In a related but slightly different problem, 
the goal is to find low-rank approximations A = CUR where C contains a 
subset of columns of A and R contains a subset of rows. Papers devoted to 



this task include 25, 26, 27, 28 



In our setting, since we assume no direct 
access of entries of the matrix A but only its impact through matrix-vector 
multiplications, the algorithm proposed by 0] is the most relevant choice. 
An excellent recent review of this fast growing field can be found in joj . 
In a recent paper 29|, Martinsson considered also the problem of con- 



structing the "H-matrix representation of a matrix, but he assumed that one 
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can access arbitrary entries of the matrix besides the fast matrix- vector mul- 
tiphcation subroutine. Under this extra assumption, he showed that one 
can construct the 'H? representation of the matrix with 0{1) matrix- vector 
muhiphcations and accesses of 0{n) matrix entries. However, in many situ- 
ations including the case of G being the inverse of the stiffness matrix of an 
eUiptic differential operator, accessing entries of G is by no means a trivial 
task. Comparing with Martinsson's work, our algorithm only assumes the 
existence of a fast matrix-vector multiplication subroutine, and hence is more 
general. 

As we mentioned earlier, one motivation for computing G explicitly is to 
further compress the matrix G. The most common example in the literature 
of numerical analysis is the process of numerical homogenization or upscaling 
jij . Here the matrix G is often again the inverse of the stiffness matrix H of an 
elliptic partial differential operator. When H contains information from all 
scales, the standard homogenization techniques fail. Recently, Owhadi and 
Zhang 30| proposed an elegant method that, under the assumption that the 
Cordes condition is satisfied, upscales a general H in divergence form using 
metric transformation. Computationally, their approach involves d solves of 
form Hu = f with d being the dimension of the problem. On the other hand, 
if G is computed using our algorithm, one can obtain the upscaled operator 
by inverting a low-passed and down-sampled version of G. Complexity- wise, 
our algorithm is more costly since it requires 0{logn) solves of Hu = f. 
However, since our approach makes no analytic assumptions about H, it is 
expected to be more general. 



2. Algorithm 

We now present the details of our algorithm in two dimensions. In ad- 
dition to a top-down construction using the peeling idea presented in the 
introduction, the complexity will be further reduced using the "H^ property 
of the matrix [l], [sj . The extension to three dimensions is straightforward. 

In two dimensions, a more conservative partition of the domain is re- 
quired to guarantee the low-rankness of the matrix blocks. We will start 
with discussion of this new geometric setup. Then we will recall the notion 
of hierarchical matrices and related algorithms in Section [221 The algorithm 
to construct an T-L^ representation for a matrix using matrix-vector multi- 
plications will be presented in Sections 12.31 and 12.41 Finally, variants of the 
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algorithm for constructing the "H^ and uniform "H^ representations will be 
described in Section 1^31 

2.1. Geometric setup and notations 

Let us consider an operator G defined on a 2D domain [0, 1)^ with periodic 
boundary condition. We discretize the problem using a.nn = N x N uniform 
grid with being a power of 2: = 2^'^' . Denote the set of all grid points 

as 

Jo = {{h/N, k2/N) \kuk2eN,0<h,k2< N} (11) 

and partition the domain hierarchically into L + 1 levels {L < L^)- On each 
level / (0 < / < L), we have 2' x 2' boxes denoted by I^j = [{i - l)/2\i/f) x 
[(j — l)/2', j'/2') for 1 < j < 2'. The symbol X;;jj will also be used to denote 
the grid points that lies in the box X^ij. The meaning should be clear from 
the context. We will also use X/(or Ji) to denote a general box on certain 
level /. The subscript / will be omitted, when the level is clear from the 
context. For a given box Xi for / > 1, we call a box Ji-i on level / — 1 its 
parent if X; C Ji-i. Naturally, Xi is called a child of Ji^i. It is clear that 
each box except those on level L will have four children boxes. 

For any box X on level /, it covers x 7V/2' grid points. The last level 
L can be chosen so that the leaf box has a constant number of points in it 
{i.e. the difference Lm — L is kept to be a constant when increases). 

For simplicity of presentation, we will start the method from level 3. It 
is also possible to start from level 2. Level 2 needs to be treated specially, as 
for level 3. We define the following notations for a box X on level Z (/ > 3): 

NL(X) Neighbor list of box X. This list contains the boxes on level I that are 
adjacent to X and also X itself. There are 9 boxes in the list for each 
X. 

IL(X) Interaction list of box X. When / = 3, this list contains all the boxes 
on level 3 minus the set of boxes in NL(X). There are 55 boxes in total. 
When / > 3, this list contains all the boxes on level / that are children 
of boxes in NL('P) with V being X's parent minus the set of boxes in 
NL(X). There are 27 such boxes. 

Notice that these two lists determine two symmetric relationship: J G NL(X) 
if and only if X G NL(J) and J G IL(X) if and only if X G IL(J^). Figs. □ 
and [2] illustrate the computational domain and the lists for / = 3 and / = 4, 
respectively. 
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Box J3.3_3 


■ 


Adjacent 


■ 


Interaction 




Figure 1: Illustration of the computational domain at level 3. 23;3,3 is the black box. The 
neighbor list N 1(23:3^3) consists of 8 adjacent light gray boxes and the black box itself, 
and the interaction list IL(23;3^3) consists of the 55 dark gray boxes. 



■ 


Box X4;5^5 


m 


Adjacent 


■ 


Interaction 




Figure 2: Illustration of the computational domain at level 4. 14-5^5 is the black box. The 
neighbor list N 1(1455^5) consists of 8 adjacent light gray boxes and the black box itself, 
and the interaction list IL(l4;5_5) consists of the 27 dark gray boxes. 
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For a vector / defined on the N x N grid Xq, we define /(X) to be the 
restriction of / to grid points X. For a matrix G G IR^^^^^ that represents a 
linear map from Xq to itself, we define G{I, J') to be the restriction of G on 
IxJ. 

A matrix G G has the following decomposition 

G = + G(') + ■ ■ ■ + G(^) + D(^). (12) 

Here, for each /, G^''^ incorporates the interaction on level / between a box 
with its interaction list. More precisely, G^'-* has a 2^' x 2^' block structure: 

[0, otherwise 

with X and both on level /. The matrix D^^^ includes the interactions 
between adjacent boxes at level L: 

) 0, otherwise 

with X and both on level L. To show that (1T2|) is true, it suffices to prove 
that for any two boxes X and JT" on level L, the right hand side gives G(X, ^7). 
In the case that X G NL(j7'), this is obvious. Otherwise, it is clear that we can 
find a level /, and boxes X' and JT"' on level /, such that X' G \L{J''), X C X' 
and i7 C J'', and hence (^(X, J') is given through G(X', J''). Throughout the 
text, we will use \\A\\2 to denote the matrix 2-norm of matrix A. 

2.2. Hierarchical matrix 

Our algorithm works with the so-called hierarchical matrices. We recall 
in this subsection some basic properties of this type of matrices and also 
some related algorithms. For simplicity of notations and representation, we 
will only work with symmetric matrices. For a more detailed introduction of 
the hierarchical matrices and their applications in fast algorithms, we refer 
the readers to (2!, Isj . 

2.2.1. matrices 

Definition 1. G is a (symmetric) "H^-matrix if for any e > 0, there exists 
r{e) < log(£:^^) such that for any pair (X, JT") with X G IL(j7'), there exist 
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orthogonal matrices Uxj and U jx with r{e) columns and matrix Mxj G 

\\G{X,J) - UxjMxjUj^h < e\\G{X,J)h. (13) 

The main advantage of the "H^ matrix is that the application of such 
matrix on a vector can be efficiently evaluated: Within error 0{e), one 
can use G(X, J') = UxjMxjUjx, which is low-rank, instead of the origi- 
nal block G{X, J'). The algorithm is described in Algorithm [2l It is standard 
that the complexity of the matrix-vector multiplication for an "H^ matrix is 
0{NHogN) g. 

Algorithm 2 Application of a "H^-matrix G on a vector /. 
1: u = 0; 

2: for / = 3 to L do 

3: for I on level I do 

4: for J G IL(I) do 

5: u{I) = u{X) + UxAMxj{U}^f{J))); 

6: end for 

7: end for 

8: end for 

9: for X on level L do 
10: for J G NL(Z) do 
11: u{X)=u{l) + G{X,J)f{J)- 

12: end for 
13: end for 



2.2.2. Uniform 'H} matrix 

Definition 2. G is a (symmetric) uniform "H^-matrix if for any e > 0, there 
exists ru{e) < log(£:~^) such that for each box X, there exists an orthogonal 
matrix Ux with ru{e) columns such that for any pair (X, J) with X G \\-{J) 

\\G{X,J)-UxNxjU}h<e\\G{X,J)\\^ (14) 

with Nxj e M^t/(e)xry(e)^ 

The application of a uniform 1-0^ matrix to a vector is described in Algo- 
rithm [31 The complexity of the algorithm is still 0{N'^ log N). However, the 
prefactor is much better as each Ux is applied only once. The speedup over 
Algorithm [2] is roughly 21r{e) /ru{e) 
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Algorithm 3 Application of a uniform "H^-matrix G on a vector / 



1 

± 


11 — n- 

Li U, 


1 


for / — 3 to L do 


o 
z 


for / = 3 to iv do 


Id 


for X on level I do 


Q 

o 


for on level I do 


17 


U[l) = U[l) + UxUx, 


4 


fj = UjfiJ); 


18 


end for 


5 


end for 


19 


end for 


6 


end for 


20 


for Z on level L do 


7 


for / = 3 to L do 


21 


for J G NL(Z) do 


8 


for Z on level I do 


22 


u{Z) = u{Z) + 


9 


Ux = 0; 




G{Z,J)f{J)- 


10 


for J G IL(Z) do 


23 


end for 


11 


ux = ui + Nxjfj; 


24 


end for 


12 


end for 






13 


end for 






14 


end for 







2.2.3. matrices 

Definition 3. G is an "H^ matrix if 

• it is a uniform "H^ matrix; 

• Suppose tliat C is any cliild of a box X, tlien 

\\UxiC,:)-UcTcxh<e, (15) 

for some matrix Tex E M^-u(^)x'-c/{e). 

The application of an "H^ matrix to a vector is described in Algorithm |4] 
and it has a complexity of (!?(A^), Notice that, compared with "H^ matrix, 
the logarithmic factor is reduced |3|. 

Remark. Applying an "H^ matrix to a vector can indeed be viewed as the 
matrix form of the fast multipole method (FMM) [lo|. One recognizes in 
Algorithm H] that the second top-level for loop corresponds to the M2M 
(multipole expansion to multipole expansion) translations of the FMM; the 
third top-level for loop is the M2L (multipole expansion to local expansion) 
translations; and the fourth top-level for loop is the L2L (local expansion to 
local expansion) translations. 

In the algorithm to be introduced, we will also need to apply a partial 
matrix G^^) + G^^^ + ■■■ + G^^'^ for some L' < L to a vector /. This amounts 
to a variant of Algorithm HI described in Algorithm O 
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Algorithm 4 Application of a "H^-matrix G on a vector / 
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for X on level L do 
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for J € IL(Z) do 
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2. 3. Peeling algorithm: outline and preparation 

We assume that G is a symmetric "H^ matrix and that there exists a fast 
matrix- vector subroutine for applying G to any vector / as a "black box". 
The goal is to construct an "H^ representation of the matrix G using only a 
small number of test vectors. 

The basic strategy is a top-down construction: For each level / = 3, . . . , L, 
assume that an T-L^ representation for G^^^-l-- • ■ + is given, we construct 

G^''^ by the following three steps: 

1. Peeling. Construct an "H^ representation for G*^'^ using the peeling idea 
and the "H^ representation for G^^^ -|- ■ ■ ■ -|- G*-'"^-*. 

2. Uniformization. Construct a uniform "H^ representation for G^'^ from 
its representation. 

3. Projection. Construct an representation for G^'^^ + ■ • • + G^'^ 

The names of these steps will be made clear in the following discussion. 
Variants of the algorithm that only construct an Ti^ representation (a uniform 
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Algorithm 5 Application of a partial "H^-matrix + - ■ ■ + G^^''^ on a vector 
/ 
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for each child C of X do 
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Ti^ representation, respectively) of the matrix G can be obtained by only 
doing the peeling step (the peeling and uniformization steps, respectively). 
These variants will be discussed in Section 12.51 

After we have the T-L^ representation for G*-^^ + ■ ■ ■ + G^^\ we use the 
peeling idea again to extract the diagonal part D^^\ We call this whole 
process the peeling algorithm. 

Before detailing the peeling algorithm, we mention two procedures that 
serve as essential components of our algorithm. The first procedure con- 
cerns with the uniformization step, in which one needs to get a uniform 
T-L^ representation for G^''^ from its representation, i.e., from G{X,J') = 
UxjMxjUjx to G(X, J) = UxNxjUj, for all pairs of boxes (J, J) with 
X G IL(i7). To this end, what we need to do is to find the column space of 

[Uxj.Mxj, I Uxj.Mxj, I ■ ■ ■ I Uxj.Mxj,], (16) 

where J'j are the boxes in IL(X) and t = |IL(X)|. Notice that we weight the 
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singular vectors U by M, so that the singular vectors corresponding to larger 
singular values will be more significant. This column space can be found by 
the usual SVD algorithm or a more effective randomized version presented in 
Algorithm ini The important left singular vectors are denoted by Uj, and the 
diagonal matrix formed by the singular values associated with Ux is denoted 
by Sx- 

Algorithm 6 Construct a uniform "H^ representation of G from the Ti^ 
representation at a level / 
1: for each box X on level / do 

2: Generate a Gaussian random matrix R e m(''(^)^*)>^('''^(^)+^); 

3: Form product [Uxj^Mxj^ \ ■■■ \ UxjtMxjt]R and apply SVD to it. 

The first ru{s) left singular vectors give Ux, and the corresponding 

singular values give a diagonal matrix Sx', 
4: for Jj do 
5: ljj^=u:[Uxj^; 
6: end for 
7: end for 

8: for each pair (X, J') on level I with I G IL(j7) do 

9: Nxj = IxjMxjiJx; 
10: end for 



Complexity analysis: For a box X on level /, the number of grid points 
in I is {N/2^f. Therefore, Uxj^ are all of size {N/2^f x r{e) and Mxj 
are of size r(e) x r{e). Forming the product [Uxj-i^Mxj-^ I ' ' ' I UxjtMxjt]R 
takes 0{{N/2^yr{e){ruie) + c)) steps and SVD takes 0{{N/2^y{ruie)+cf) 
steps. As there are 2^' boxes on level /, the overall cost of Algorithm [H] 
is 0{N^{ruie) + cf) = 0{N^). One may also apply to [Uxj.Mxj, \ ■■■ \ 
UxjfMxjt] the deterministic SVD algorithm, which has the same order of 
complexity but with a prefactor about 27r{e)/ {ru{£) + c) times larger. 

The second procedure is concerned with the projection step of the above 
list, in which one constructs an "H^ representation for G^^^ + ■ ■ ■ G*'-*. Here, we 
are given the representation for G^^^ + ■ ■ ■ + G*-'"^-' along with the uniform 

representation for G^''^ and the goal is to compute the transfer matrix Tex 
for a box X on level / — 1 and its child C on level / such that 

\\Ux{C,:)-UcTcx\\2<e. 
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In fact, the existing Uc matrix of the uniform "H^ representation may not be 
rich enough to contain the columns of Ux{C, :) in its span. Therefore, one 
needs to update the content of Uc as well. To do that, we perform a singular 
value decomposition for the combined matrix 

[Ux{C,:)Si\UcSc] 

and define a matrix Vc to contain ru{e) left singular vectors. Again Ux,Uc 
should be weighted by the corresponding singular values. The transfer matrix 
Tci is then given by 

Tci = VjUxiC :) 

and the new Uc is set to be equal to Vc- Since Uc has been changed, the 
matrices Ncv for V G IL(C) and also the corresponding singular values Sc 
need to be updated as well. The details are listed in Algorithm [71 

Algorithm 7 Construct an "H^ representation of G from the uniform "H^ 
representation at level / 

1: for each box X on level / — 1 do 

2: for each child C of I do 

3: Form matrix [Ui{C,:)Sx \ UcSc] and apply SVD to it. The 
first ru[e) left singular vectors give Vc, and the corresponding 
singular values give a diagonal matrix Wc] 

4: Kc = V^Uc] 

5: TcT = V^Ux{e,:)- 

6: Uc = Vc; 

7: Sc = Wc\ 

8: end for 
9: end for 

10: for each pair (C,!?) on level I with C G IL(^^) do 
11: Ncv = KcNcvKl; 
12: end for 



Complexity analysis: The main computational task of Algorithm [7] is 
again the SVD part. For a box C on level /, the number of grid points in 
X is (iV/2')^. Therefore, the combined matrix [f/j(C, ■.)Sx \ UcSc] is of size 
{N/2^f X 2ru{e). The SVD computation clearly takes 0{{N /2^fru{ef) = 
0{{N/2'-)'^) steps. Taking into the consideration that there are 2^' boxes on 
level I gives rise to an (9(A^^) estimate for the cost of Algorithm [71 
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2.4- Peeling algorithm: details 

With the above preparation, we are now ready to describe the peehng 
algorithm in detail at different levels, starting from level 3. At each level, we 
follow exactly the three steps listed at the beginning of Section 12.31 

24.1. Levels 

First in the peeling step, we construct the representation for G^^\ For 
each pair iX^J) on level 3 such that X G \\-{J), we will invoke randomized 
SVD Algorithm [1] to construct the low rank approximation of Gx,j- How- 
ever, in order to apply the algorithm we need to compute G{X, J)Rj and 
RlG{X,J), where Rx and Rj are random matrices with r{e) + c columns. 
For each box J on level 3, we construct a matrix R of size A^^ x {r{e) + c) 
such that 

R{J,:) = Rj and i?(Xo\ J, :) = 0. 

Computing GR using r(e) +c matvecs and restricting the result to grid points 
X G \liJ) gives G{I, J)Rj for each X G IL(J^). 

After repeating these steps for all boxes on level 3, we hold for any pair 
(X, J) with X G \\-{J) the following data: 

G{X,J)Rj and RIG{X, J) = {G{J ,X)Rx)'^ . 

Now, applying Algorithm [1] to them gives the low-rank approximation 

G{X,J) = UxjMxjU}^. (17) 

In the uniformization step, in order to get the uniform 'h} representation 
for G'^'^\ we simply apply Algorithm [6] to the boxes on level 3 to get the 
approximations 

GiX,J) = UxNxjU}. (18) 

Finally in the projection step, since we only have 1 level now (level 3), 
we have already the "H^ representation for G'^^\ 

Complexity analysis: The dominant computation is the construction of 
the IH} representation for G^^\ This requires r{e) + c matvecs for each box 
X on level 3. Since there are in total 64 boxes at this level, the total cost 
is 64(r(£) -f c) matvecs. From the complexity analysis in Section 12. 3^ the 
computation for the second and third steps cost an extra Oi^N"^) steps. 
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2.4-2. Level A 

First in the peehng step, in order to construct the 'h} representation for 
G^^\ we need to compute the matrices G{X, J)Rj and E^G{X, J) for each 
pair i^^J^ on level 4 with X G IL(jr). Here Rx and Rj are again random 
matrices with r{£) + c columns. 

One approach is to follow exactly what we did for level 3: Fix a box J 
at this level, construct R of size A^^ x {r[e) + c) such that 

R{J, :) = Rj and R{X^\J , :) = 0. 

Next apply G — G^^'^ to R, by subtracting GR and G^^^R. The former is 
computed using r(e) + c matvecs and the latter is done by Algorithm [51 
Finally, restrict the result to grid points X G 11(^7") gives G{I, J)Rj for each 

xg iL(:r). 

However, we have observed in the simple one- dimensional example in 
Section 11.31 that random tests can be combined together as in Eq. ([S]) and 
dZj). We shall detail this observation in the more general situation here as 
following. Observe that G-G^^) _ g(4) + /^(4)^ G^'^\j ,X) and D^^\j ,T) 
for boxes X and J on level 4 is only nonzero if X G HL{J) VJ\\-{J). Therefore, 
{G — G^^^)R for R coming from J can only be nonzero in NL(P) with V 
being jT's parent. The rest is automatically zero (up to error e as 

G(3) 

is approximated by its IH? representation). Therefore, we can combine the 
calculation of different boxes as long as their non-zero regions do not overlap. 
More precisely, we introduce the following sets Spq for 1 < p, g < 8 with 

Spq = {Ji-ij \ i = p (mod 8), j = q (mod 8)}. (19) 

There are 64 sets in total, each consisting of four boxes. Fig. [3] illustrates 
one such set at level 4. For each set Spq, first construct R with 

R{J ■) = \ ^ ^ 

)0, otherwise. 

Then, we apply G — G^^^ to /?, by subtracting GR and G^^^R. The former 
is computed using r{e) + c matvecs and the latter is done by Algorithm [51 
For each J' G Spq, restricting the result to X G 11(^7") gives G{X, J)Rj. 
Repeating this computation for all sets Spq then provides us with the following 
data: 

G{X,J)Rj and RIG{X, J) = {G{J ,X)Rxf , 
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for each pair iX^J) with X G \\-{J). Applying Algorithm [T] to them gives 
the required low-rank approximations 

G{I,J) = UxjMjjU}^ (20) 

with Uxj orthogonal. 




Figure 3: Illustration of the set S55 at level 4. This set consists of four black boxes 
{X4;5^5, 24^13^5, 14:5_i3,l4;i3^i3}. The light gray boxes around each black box are in the 
neighbor list and the dark gray boxes in the interaction list. 

Next in the uniformization step, the task is to construct the uniform "H^ 
representation of G^^\ Similar to the computation at level 3, we simply apply 
Algorithm [6] to the boxes on level 4 to get 

G{I,J) = UjNxjU}. (21) 

Finally in the projection step, to get "H^ representation for G^^^ +G^'^\ we 
invoke Algorithm [7] at level 4. Once it is done, we hold the transfer matrices 
Tex between any X on level 3 and each of its children C, along with the 
updated uniform "H^-matrix representation of G^^\ 

Complexity analysis: The dominant computation is again the construc- 
tion of "H^ representation for G^^\ For each group Spq, we apply G to r{e) + c 
vectors and apply G^'^^ to r{e) + c vectors. The latter takes 0{N'^) steps for 
each application. Since there are 64 sets in total, this computation takes 
64(r(£) + c) matvecs and 0{N'^) extra steps. 
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2.4-3. Level I 

First in the peehng step, to construct the representation for G*-'-*, we 
follow the discussion of level 4. Define 64 sets Spq for 1 < p,q < 8 with 

Spq = {Ji;ij \i = p (mod 8), j = q (mod 8)}. (22) 

Each set contains exactly 2'/8 x 2'/8 boxes. For each set Spq, construct R 
with 

lo, otherwise. 

Next, apply G-[G^'^^ + -- ■ + G^^'^'>] to R, by subtracting GR and + . ■ ■ + 
G^''~^^]R. The former is again computed using r{e) + c matvecs and the latter 
is done by Algorithm [5] using the representation of G^'^'^ + ■ ■ ■ + G^^~^\ 
For each J' G Spq, restricting the result to X G IL(^7) gives G{X, J)Rj. 
Repeating this computation for all sets Spq gives the following data for any 
pair {X,J) withXG \\-{J) 

G{X,J)Rj and RIG{X, J) = {G{J ,X)Rxf . 

Now applying Algorithm [1] to them gives the low-rank approximation 

G{X,J) = UxjMxjU}^ (23) 

with Uxj orthogonal. 

Similar to the computation at level 4, the uniformization step that con- 
structs the uniform "H^ representation of G*^'^ simply by Algorithm |6] to the 
boxes on level /. The result gives the approximation 

G{X,J) = UxNxjU}. (24) 

Finally in the projection step, one needs to compute an 'h? representation 
for G^^^ + ■ ■ ■ + G'^^y To this end, we apply Algorithm [7] to level /. 

The complexity analysis at level / follows exactly the one of level 4. Since 
we still have exactly 64 sets Spq, the computation again takes 64(r(£:) + c) 
matvecs along with 0{N'^) extra steps. 

These three steps (peeling, uniformization, and projection) are repeated 
for each level until we reach level L. At this point, we hold the 'H? represen- 
tation for + 
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2.4-4- Computation of D^^^ 

Finally we construct of the diagonal part 

= G'-(G(3) + --- + G(^)). (25) 

More specifically, for each box J' on level L, we need to compute G{I, J') 
forXG Nl{J). 

Define a matrix E of size A^^ x {N/2^Y (recall that the box J on level 
L covers [N/2^Y grid points) by 

E{J,:) = I and E (Xo\J , :) = , 

where / is the x {N/2^Y identity matrix. Applying G - (G^^) + 

■ ■ ■ + G^^)) to E and restricting the results to X G NL(^7) gives G(X, J) for 
X G NL(jr). However, we can do better as (G - (G^^) + ■ ■ ■ + G^^^))E is only 
non-zero in NL{J'). Hence, one can combine computation of different boxes 
JT" as long as NL(j7) do not overlap. 

To do this, define the following 4 x 4 = 16 sets Spg, 1 < p, 5' < 4 

Spg = {Ji-ij \i^p (mod 4), j = q (mod 4)}. 

For each set Spg, construct matrix E by 

E{J ■) = l^' ^ ^ 

lo, otherwise. 

Next, apply G - (G^^^ H h G^^^) to E. For each J G restricting the 

result to X G NL(^7) gives G(X, J^l = G(X, J'). Repeating this computation 
for all 16 sets Spg gives the diagonal part D^^\ 

Complexity analysis: The dominant computation is for each group Spg 

apply G and G^^^ H hG*^^) to E, the former takes {N/2^y matvecs and the 

latter takes 0{{N/2^)'^N^) extra steps. Recall by the choice of L, N/2^ is a 
constant. Therefore, the total cost for 16 sets is 16(A^/2'^)^ = 0{1) matvecs 
and C(A^^) extra steps. 

Let us now summarize the complexity of the whole peeling algorithm. 
From the above discussion, it is clear that at each level the algorithm spends 
64(r(e) + c) = 0{1) matvecs and 0{N'^) extra steps. As there are O(logA^) 
levels, the overall cost of the peeling algorithm is equal to C(log A^) matvecs 
plus C»(A^^logA^) steps. 
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It is a natural concern that whether the error from low-rank decompo- 
sitions on top levels accumulates in the peeling steps. As observed from 
numerical examples in Section |3l it does not seem to be a problem at least 
for the examples considered. We do not have a proof for this though. 

2.5. Peeling algorithm: variants 

In this section, we discuss two variants of the peeling algorithm. Let us 
recall that the above algorithm performs the following three steps at each 
level /. 

1. Peeling. Construct an "H^ representation for G*^') using the peeling idea 
and the "H^ representation for G^^^ + • • • + G^''~^\ 

2. Uniformization. Construct a uniform Ti^ representation for G*-'-* from 
its "H^ representation. 

3. Projection. Construct an representation for G^^^ + ■ ■ • + G^''\ 

As this algorithm constructs the representation of the matrix G, we also 
refer to it more specifically as the version of the peeling algorithm. In 
what follows, we list two simpler versions that are useful in practice 

• the "H^ version, and 

• the uniform Ti^ version. 

In the "H^ version, we only perform the peeling step at each level. Since 
this version constructs only the representation, it will use the repre- 
sentation of G^^^ + ■ ■ ■ + G^''' in the computation of {G^^^ -f--- + G('))i? within 
the peeling step at level / + 1. 

In the uniform T-L^ version, we perform the peeling step and the uni- 
formization step at each level. This will give us instead the uniform "H^ 
version of the matrix. Accordingly, one needs to use the uniform "H^ repre- 
sentation of G^^^ + ■ ■ ■ + G*^'-* in the computation of (G*-^^ + --- + G('))/? within 
the peeling step at level / + 1. 

These two simplified versions are of practical value since there are ma- 
trices that are in the Ti^ or the uniform T-L^ class but not the class. A 
simple calculation shows that these two simplified versions still take 0{\ogN) 
matvecs but requires 0{N'^log'^ N) extra steps. Clearly, the number of extra 
steps is log times more expensive than the one of the T-L^ version. However, 
if the fast matrix-vector multiplication subroutine itself takes 0{N'^ log N) 
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steps per application, using the "H^ or the uniform "H^ version does not change 
the overall asymptotic complexity. 

Between these two simplified versions, the uniform Ti^ version requires the 
uniformization step, while the "H^ version does not. This seems to suggest 
that the uniform version is more expensive. However, because (1) our 
algorithm also utilizes the partially constructed representations for the cal- 
culation at future levels and (2) the uniform Ti^ representation is much faster 
to apply, the construction of the uniform "H^ version turns out to be much 
faster. Moreover, since the uniform "H^ representation stores one Ux matrix 
for each box X while the "H^ version stores about 27 of them, the uniform 
"H^ is much more memory-efficient, which is very important for problems in 
higher dimensions. 



3. Numerical results 

We study the performance of the hierarchical matrix construction algo- 
rithm for the inverse of a discretized elliptic operator. The computational 
domain is a two dimensional square [0, 1)^ with periodic boundary condition, 
discretized as an x equispaced grid. We first consider the operator 
H = —A + V with A being the discretized Laplacian operator and the po- 
tential being V{i,j) = 1 + W{i, j), i,j = 1, . . . , N. For all W{i,j) are 
independent random numbers uniformly distributed in [0, 1]. The potential 
function V is chosen to have this strong randomness in order to show that the 
existence of "H-matrix representation of the Green's function depends weakly 
on the smoothness of the potential. The inverse matrix of H is denoted by 
G. The algorithms are implemented using MATLAB. All numerical tests are 
carried out on a single-CPU machine. 

We analyze the performance statistics by examining both the cost and 
the accuracy of our algorithm. The cost factors include the time cost and 
the memory cost. While the memory cost is mainly determined by how 
the matrix G is compressed and does not depend much on the particular 
implementation, the time cost depends heavily on the performance of matvec 
subroutine. Therefore, we report both the wall clock time consumption of 
the algorithm and the number of calls to the matvec subroutine. The matvec 
subroutine used here is a nested dissection reordered block Gauss elimination 



method [ij]. For an x discretization of the computational domain, this 
matvec subroutine has a computational cost of (9(A^^logA^) steps. 
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Table [T] summarizes the matvec number, and the time cost per degree of 
freedom (DOF) for the "H^, the uniform "H^ and the 'H? representations of 
the peehng algorithm. The time cost per DOF is defined by the total time 
cost divided by the number of grid points A^^. For the l-L^ and the uniform 
'H} versions, the error criterion e in Eq. ([T3|l . Eq. ([T^ and Eq. ( 1T5|) are all 
set to be 10"^ 

The number of calls to the matvec subroutine is the same in all three cases 
(as the peeling step is the same for all cases) and is reported in the third 
column of Tabledl It is confirmed that the number of calls to matvec increases 
logarithmically with respect to if the domain size at level L, i.e. 2^*^~^, 
is fixed as a constant. For a fixed A^, the time cost is not monotonic with 
respect to L. When L is too small the computational cost of D^^"* becomes 
dominant. When L is too large, the application of the partial representation 
C^^) + . . . + G'^^^ to a vector becomes expensive. From the perspective of 
time cost, there is an optimal Lopt for a fixed A^. We find that this optimal 
level number is the same for "H^, uniform "h} and 'h? algorithms. Table [1] 
shows that Lopt = 4 for A^ = 32, 64, 128, Lopt = 5 for A^ = 256, and Lopt = 6 
for A^ = 512. This suggests that for large A^, the optimal performance is 
achieved when the size of boxes on the final level L is 8 x 8. In other words, 
L = Ljv/ — 3. 

The memory cost per DOF for the "H^, the uniform "H^ and the 'h? al- 
gorithms is reported in Table HI The memory cost is estimated by summing 
the sizes of low-rank approximations as well as the size of D^^\ For a fixed 
A^, the memory cost generally decreases as L increases. This is because as L 
increases, an increasing part of the original dense matrix is represented using 
low-rank approximations. 

Both Table [T] and Table |2] indicate that uniform algorithm is signifi- 
cantly more advantageous than "h} algorithm, while the 'h? algorithm leads 
to a further improvement over the uniform 'H} algorithm especially for large 
A^. This fact can be better seen from Fig. |4] where the time and memory 
cost per DOF for A^ = 32, 64, 128, 256, 512 with optimal level number Lopt 
are shown. We remark that since the number of calls to the matvec subrou- 
tine are the same in all cases, the time cost difference comes solely from the 
efficiency difference of the low rank matrix- vector multiplication subroutines. 

We measure the accuracy for the IH}, the uniform "h} and the 'H? rep- 
resentations of G with its actual value using the operator norm (2-norm) of 
the error matrix. Here, the 2-norm of a matrix is numerically estimated by 
power method jof using several random initial guesses. We report both abso- 
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lute and relative errors. According to Table [31 the errors are well controlled 
with respect to both increasing and L, in spite of the more aggressive 
matrix compression strategy in the uniform T-L^ and the representations. 
Moreover, for each box X, the rank ru{e) of the uniform Ti^ representation 
is only slightly larger than the rank r{e) of the "H^ representation. This can 
be seen from Table ID Here the average rank for a level / is estimated by 
averaging the values of ru{e) (or r{e)) for all low-rank approximations at 
level /. Note that the rank of the "H^ representation is comparable to or even 
lower than the rank in the uniform "H^ representation. This is partially due 
to different weighting choices in the uniformization step and T-L^ construction 
step. 
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Ti^ time 
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number 
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0.0080 


0.0084 


64 


4 


3376 


0.0051 


0.0033 


0.0033 


64 


5 


4471 


0.0150 


0.0102 


0.0106 


128 


4 


4116 


0.0050 


0.0025 


0.0024 


128 


5 


4639 


0.0080 


0.0045 


0.0045 


128 


6 


5730 


0.0189 


0.0122 


0.0125 


256 


4 


7169 


0.015 


0.0054 


0.0050 


256 


5 


5407 


0.010 


0.0035 


0.0033 


256 


6 


5952 


0.013 


0.0058 


0.0057 


256 


7 


7021 


0.025 


0.0152 


0.0154 


512 


5 


8439 


0.025 


0.0070 


0.0063 


512 


6 


6708 


0.018 


0.0050 


0.0044 


512 


7 


7201 


0.022 


0.0079 


0.0072 



Tabic 1: matvec numbers and time cost per degree of freedom (DOF) for the H^, the 
uniform and the representations with different grid point per dimension N and low 
rank compression level L. The matvec numbers are by definition the same in the three 
algorithms. 
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N 

1 V 


T 

±J 


1-/^ YY^f^YY^rw^r 

1 i, LLL\^LLLKJL y 


T Tti 1 ^r\TYY\ 1-/ Tyi*3Tn oT^r 

liliIJi 111 / L liiv^liilji y 


"T-/^ YY~\ t^YY^ rw^r 
1 i, liidiiiji y 






npr DDF rMR^I 


npr DDF rMR^I 


npr DDF (MT^) 


32 


4 


0038 


0.0024 


0.0024 




A 




n nn97 






c; 
o 


U.UUOl 


n nn97 


n nn9fi 


128 


4 


0.0075 


0.0051 


0.0049 


128 


5 


0.0056 


0.0029 


0.0027 


128 


6 


0.0063 


0.0029 


0.0027 


256 


4 


0.0206 


0.0180 


0.0177 


256 


5 


0.0087 


0.0052 


0.0049 


256 


6 


0.0067 


0.0030 


0.0027 


256 


7 


0.0074 


0.0030 


0.0027 


512 


5 


0.0218 


0.0181 


0.0177 


512 


6 


0.0099 


0.0053 


0.0049 


512 


7 


0.0079 


0.0031 


0.0027 



Table 2: Memory cost per degree of freedom (DOF) for the H^, the uniform TL^ and the 
versions with different grid point per dimension N and low rank compression level L. 




Figure 4: Comparison of the time and memory costs for the H^, the uniform Ti^ and the 
versions with optimal level Lopt for TV = 32, 64, 128, 256, 512. The x-axis (N) is set to 
be in logarithmic scale. 
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N 


L 


n 


1 


Uniform 


n 


2 






Absolute 


Relative 


Absolute 


Relative 


Absolute 


Relative 






error 


error 


error 


error 


error 


error 


32 


4 


2.16e-07 


3.22e-07 


2.22e-07 


3.31e-07 


2.20e-07 


3.28e-07 


64 


4 


2.10e-07 


3.15e-07 


2.31e-07 


3.47e-07 


2.31e-07 


3.46e-07 


64 


5 


1.96e-07 


2.95e-07 


2.07e-07 


3.12e-07 


2.07e-07 


3.11e-07 


128 


4 


2.16e-07 


3.25e-07 


2.26e-07 


3.39e-07 


2.24e-07 


3.37e-07 


128 


5 


2.60e-07 


3.90e-07 


2.68e-07 


4.03e-07 


2.67e-07 


4.02e-07 


128 


6 


2.01e-07 


3.01e-07 


2.09e-07 


3.13e-07 


2.08e-07 


3.11e-07 


256 


4 


1.78e-07 


2.66e-07 


1.95e-07 


2.92e-07 


2.31e-07 


3.46e-07 


256 


5 


2.11e-07 


3.16e-07 


2.26e-07 


3.39e-07 


2.27e-07 


3.40e-07 


256 


6 


2.75e-07 


4.12e-07 


2.78e-07 


4.18e-07 


2.30e-07 


3.45e-07 


256 


7 


1.93e-07 


2.89e-07 


2.05e-07 


3.08e-07 


2.24e-07 


3.36e-07 


512 


5 


2.23e-07 


3.35e-07 


2.33e-07 


3.50e-07 


1.42e-07 


2.13e-07 


512 


6 


2.06e-07 


3.09e-07 


2.17e-07 


3.26e-07 


2.03e-07 


3.05e-07 


512 


7 


2.67e-07 


4.01e-07 


2.74e-07 


4.11e-07 


2.43e-07 


3.65e-07 



Tabic 3: Absolute and relative 2-norm errors for the H^, the uniform TL^ and the 
algorithms with different grid point per dimension N and low rank compression level L. 
The 2-norm is estimated using power method. 



/ 




Uniform 'H} 






average rank 


average rank 


average rank 


4 


6 


13 


13 


5 


6 


13 


11 


6 


6 


12 


9 



Table 4: Comparison of the average rank at different levels between the TH}, the uniform 
H\ and the algorithms, for N = 256. 
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The peeling algorithm for the construction of hierarchical matrix can 
be applied as well to general elliptic operators in divergence form H = 
— V • (a(r)V) + V{r). The computational domain, the grids are the same as 
the example above, and five-point discretization is used for the differential op- 
erator. The media is assumed to be high contrast: a{i,j) = 1 + U{i,j), with 
U{i,j) being independent random numbers uniformly distributed in [0,1]. 
The potential functions under consideration are (1) V{i,j) = 10~^W{i, j); 
(2) V{i,j) = 10~^W{i, j). W{i,j) are independent random numbers uni- 
formly distributed in [0,1] and are independent of U{i,j). We test the "H^ 
version for = 64, L = 4, with the compression criterion e = 10~^. The 
resulting absolute and relative error of the Green's function are reported 
in Table O The results indicate that the algorithms work well in these cases, 
despite the fact that the off-diagonal elements of the Green's function have a 
slower decay than the first example. We also remark that the small relative 
error for case (2) is due to the large 2-norm of when V is small. 



Potential 


Absolute error 


Relative error 




) = 10-W(i,j) 


5.91e-04 


2.97e-07 




) = 10-^W{t,j) 


3.60e-03 


1.81e-09 



Table 5: Absolute and relative 2-norin errors for the V,^ representation of the matrix 
(— V ■ (aV) + V)~^ with TV = 64, L = 4 and two choice of potential function V. The 
2-norm is estimated using power method. 

4. Conclusions and future work 

In this work, we present a novel algorithm for constructing a hierarchical 
matrix from its matrix-vector multiplication. One of the main motivations 
is the construction of the inverse matrix of the stiffness matrix of an elliptic 
differential operator. The proposed algorithm utilizes randomized singular 
value decomposition of low-rank matrices. The off-diagonal blocks of the 
hierarchical matrix are computed through a top-down peeling process. This 
algorithm is efficient. For an n x matrix, it uses only 0{logn) matrix- 
vector multiplications plus 0{nlogn) additional steps. The algorithm is also 
friendly to parallelization. The resulting hierarchical matrix representation 
can be used as a faster algorithm for matrix-vector multiplications, as well 
as for numerical homogenization or upscaling. 
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The performance of our algorithm is tested using two 2D eUiptic opera- 
tors. The T-L^, the uniform "H^ and the "H^ versions of the proposed algorithms 
are implemented. Numerical results show that our implementations are ef- 
ficient and accurate and that the uniform "H^ representation is significantly 
more advantageous over Ti^ representation in terms of both the time cost 
and the memory cost, and Ti^ representation leads to further improvement 
for large A^. 

Although the algorithms presented require only 0{logn) matvecs, the ac- 
tual number of matvecs can be quite large (for example, several thousands for 
the example in Section [3]). Therefore, the algorithms presented here might 
not be the right choice for many applications. However, for computational 
problems in which one needs to invert the same system with a huge of un- 
knowns or for homogenization problems where analytic approaches do not 
apply, our algorithm does provide an effective alternative. 

The current implementation depends explicitly on the geometric partition 
of the rectangular domain. However, the idea of our algorithm can be applied 
to general settings. For problems with unstructured grid, the only modifica- 
tion is to partition the unstructured grid with a quadtree structure and the 
algorithms essentially require no change. For discretizations of the boundary 
integral operators, the size of an interaction list is typically much smaller as 
many boxes contain no boundary points. Therefore, it is possible to design 
a more effective combination strategy with small number of matvecs. These 
algorithms can also be extended to the 3D cases in a straightforward way, 
however, we expect the constant to grow significantly. All these cases will be 
considered in the future. 
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